R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both:
- save and execute code
- generate high quality reports that can be shared with an audience
R Markdown documents are fully reproducible and support dozens of static and dynamic output formats.
Code and comments in a more readable format, works well with version control repositories (like git), allows to you test code in chunks (clean environment!), and just a whole lot more convenience than back and forth between R and Word (especially when someone critiques your figures!).
You can knit to html, PDF, or Word (probably others, too).
Some examples of things in Markdown:
You can easily add images and gifs!!
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
qqplot(iris$Petal.Length,iris$Petal.Width)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
(This is a good time to learn about tab complete!)
include = FALSE Runs the chunk, but doesn’t show results or code.echo = FALSE Shows the results, but not the code.message = FALSE Suppresses any messages.warning = FALSE Suppresses warnings.fig.cap = "blah blah blah" Easily add a caption to your figures.You can also set global options (they apply to all chunks) using knitr
e.g. knitr::opts_chunk$set(echo=TRUE)
bash, perl, python, R, etc.ls *.gif
## cats.gif
knitr or pander)require(pander)
pander(head(iris), caption = "A table made with pander")
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
you can obviously make block quotes or text that is
bold or italic, etc. Check out the R Markdown cheatsheet.
tidyverse is really just a collection of packages. There’s core packages, which you’ll likely use, and then other more specific packages.
install.packages(tidyverse) will install all of the ~20 packages
library(tidyverse) attaches the core packages only:
Data goes into data frames
ggplot2 for making figures…it’s a tidyverse package!!utils::write.csv(row.names = FALSE) = readr::write_csv())tibbleA less clunky version of data frames (also easier to build dummy data when you need help!)
tdf = tibble(x = 1:1e4
, y = rnorm(1e4)) # == data_frame(x = 1:1e4, y = rnorm(1e4))
tdf
%>%On a mac type shift + cmd + m
On a PC type shift + ctrl + m
A pipe just sends the output from the lefthand side of a function to the first argument on the right hand side of the function.
With pipes
sum(1:10) %>%
sqrt()
## [1] 7.416198
Without pipes:
sqrt(sum(1:10))
## [1] 7.416198
or
x = sum(1:10)
sqrt(x)
## [1] 7.416198
dplyrdplyr is used to manipulate data (in data frames…).
Five core functions:
filterselectarrangegroup_bysummarise…there’s a bunch more, too.
I got some pigeon racing data from the internet. It’s actually a mess, so to fix some of it real quick, let’s select only the few variables we want.
pg = read_csv("pigeon-racing.csv")
Let’s use names() to quickly see the names of the columns
names(pg)
## [1] "Pos" "Breeder" "Pigeon" "Name" "Color" "Sex"
## [7] "Ent" "Arrival" "Speed" "To Win" "Eligible"
pigeon = select(pg, Breeder, Pigeon, Color, Sex, Speed)
pigeon
Let’s look at only fast pigeons with filter
filter(pigeon, Speed > 150, Sex == "H")
With base R that’s accomplished with…
pigeon[pigeon$Speed > 150 & pigeon$Sex == "H", ]
Note that the dplyr version is less verbose, and doesn’t require remembering which side of the comma you’re on. Adding additional steps will also be simpler with the tidy format.
Let’s look at only female pigeons, and then see which breeder had the fastest pigeons. We can do this adding group_by and summarise. And we’re going to use pipes!
pigeon %>%
filter(Sex == "H") %>%
group_by(Breeder) %>%
summarise(breed.speed = mean(Speed), entries = n()) %>%
arrange(desc(breed.speed))
And, of course, if we wanted, we could have done the initial select all within a chunk. Plus it works nicely with ggplot2.
pg %>%
select(Breeder, Pigeon, Color, Sex, Speed) %>%
filter(Sex == "H") %>%
group_by(Breeder) %>%
summarise(breed.speed = mean(Speed), entries = n()) %>%
arrange(desc(breed.speed)) %>%
ggplot(aes(x=entries,y=breed.speed,color=Breeder)) +
geom_point() +
labs(x = "Number of Pigeons a Breeder Entered", y = "Mean Speed of a Breeder's Pigeons") +
theme_classic() + theme(legend.position = "none")
tidyrtidyr can gather to make wide tables long, and spread to make long tables wide. You’re most likely to use gather.
religion = read_csv("religion.csv")
## Parsed with column specification:
## cols(
## religion = col_character(),
## less.than.30k = col_double(),
## more30less50 = col_double(),
## more50less100 = col_double(),
## more100 = col_double(),
## sampleSize = col_number()
## )
religion
There’s really 3 variables: religion, income, and frequency (sample size, too)
religion %>%
gather(income, frequency, -religion, -sampleSize) %>%
arrange(religion)
An example with real data we’ve all received before.
tricho = read_csv("trichorainfallpollinators.csv")
## Parsed with column specification:
## cols(
## num = col_integer(),
## treat = col_character(),
## date = col_character(),
## time = col_time(format = ""),
## num_flow = col_integer(),
## num_lgb = col_integer(),
## num_stripey = col_integer(),
## num_bomb = col_integer(),
## num_syrph = col_integer(),
## num_tinyblackbee = col_integer(),
## num_other = col_integer()
## )
tricho
tricho %>%
gather(species, count, -num, -treat, -date, -time, -num_flow)
Now we can do useful things with it like find visit rate
tricho_tidy = tricho %>%
gather(species, count, -num, -treat, -date, -time, -num_flow) %>%
mutate(observation = paste(num,treat,date)) %>%
group_by(observation, num_flow) %>%
summarise(visits = sum(count)) %>%
mutate(visit.rate = (visits/num_flow)/10)
tricho_tidy
require(plotly)
require(viridis)
ggplotly(ggplot(data = tricho_tidy, aes(x = num_flow, y = visit.rate, color = visits, text = paste("Observation:",observation))) + geom_point() + scale_color_viridis() + labs(x = "Number of Flowers", y = "Visits Per Flower Per Minute", color = "Raw Flowers Visited") + theme_classic() + theme(legend.position = "bottom"))